my_string <- "Hi, my name is Bond!"
my_vector <- c("Hi", "my", "name", "is", "Bond")Special Data Types: Strings + Dates
Monday, April 29
Today we will…
- Reminder: no class on Wednesday!
- Comments from Week 4
- New Material
- String Variables
- Regular Expressions
- PA 5.1: Scrambled Message
Comments from Week 4
Read
Average
Total
Which or For Each
Minimum
Maximum
Think
summarize(avg = mean())
summarize(total = sum())
group_by()
slice_min()
slice_max()
String Variables
What is a string?
A string is a bunch of characters.
There is a difference between…
…a string (many characters, one object)…
and
…a character vector (vector of strings).
. . .
my_string[1] "Hi, my name is Bond!"
my_vector[1] "Hi" "my" "name" "is" "Bond"
stringr
Common tasks
- Identify strings containing a particular pattern.
- Remove or replace a pattern.
- Edit a string (e.g., make it lowercase).
knitr::include_graphics("https://github.com/rstudio/hex-stickers/blob/main/PNG/stringr.png?raw=true")- The
stringrpackage loads withtidyverse. - All functions are of the form
str_xxx().
pattern =
The pattern argument appears in many stringr functions .
- The pattern must be supplied inside quotes.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_detect(my_vector, pattern = "Bond")
str_locate(my_vector, pattern = "Bond")
str_match(my_vector, pattern = "Bond")
str_extract(my_vector, pattern = "Bond")
str_subset(my_vector, pattern = "Bond"). . .
Let’s explore these functions!
str_detect()
Returns a logical vector indicating whether the pattern was found in each element of the supplied vector.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_detect(my_vector, pattern = "Bond")[1] FALSE FALSE TRUE TRUE
. . .
- Pairs well with
filter(). - Works with
summarise()+sumormean.
. . .
str_which() returns the indexes of the strings that contain a match.
str_match()
Returns a character matrix containing either NA or the pattern, depending on if the pattern was found.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_match(my_vector, pattern = "Bond") [,1]
[1,] NA
[2,] NA
[3,] "Bond"
[4,] "Bond"
. . .
The matrix will have more columns if you use regex groups.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_match(my_vector, pattern = "(.)o(.)") [,1] [,2] [,3]
[1,] "lo," "l" ","
[2,] NA NA NA
[3,] "Bon" "B" "n"
[4,] "Bon" "B" "n"
str_extract()
Returns a character vector with either NA or the pattern, depending on if the pattern was found.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_extract(my_vector, pattern = "Bond")[1] NA NA "Bond" "Bond"
. . .
str_extract() only returns the first pattern match.
Use str_extract_all() to return every pattern match.
str_locate()
Returns a dateframe with two numeric variables – the starting and ending location of the pattern.
- The values are
NAif the pattern is not found.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_locate(my_vector, pattern = "Bond") start end
[1,] NA NA
[2,] NA NA
[3,] 1 4
[4,] 7 10
. . .
str_sub() extracts values based on a starting and ending location.
str_subset()
Returns a character vector containing a subset of the original character vector consisting of the elements where the pattern was found.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_subset(my_vector, pattern = "Bond")[1] "Bond" "James Bond"
Try it out!
my_vector <- c("I scream,", "you scream", "we all",
"scream","for","ice cream")
str_detect(my_vector, pattern = "cream")
str_locate(my_vector, pattern = "cream")
str_match(my_vector, pattern = "cream")
str_extract(my_vector, pattern = "cream")
str_subset(my_vector, pattern = "cream")For each of these functions, write down:
- the object structure of the output.
- the data type of the output.
- a brief explanation of what they do.
Replace / Remove Patterns
Replace the first matched pattern in each string.
- Pairs well with
mutate().
str_replace(my_vector, pattern = "Bond", replace = "Franco")[1] "Hello," "my name is" "Franco" "James Franco"
str_replace_all() replaces all matched patterns in each string.
Remove the first matched pattern in each string.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_remove(my_vector, pattern = "Bond")[1] "Hello," "my name is" "" "James "
This is a special case of str_replace(x, pattern, replace = "").
str_remove_all() removes all matched patterns in each string.
Edit Strings
Convert letters in a string to a specific capitalization format.
str_to_lower() converts all letters in a string to lowercase.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_to_lower(my_vector)[1] "hello," "my name is" "bond" "james bond"
str_to_upper() converts all letters in a string to uppercase.
str_to_upper(my_vector)[1] "HELLO," "MY NAME IS" "BOND" "JAMES BOND"
str_to_title() converts the first letter of each word to uppercase.
str_to_title(my_vector)[1] "Hello," "My Name Is" "Bond" "James Bond"
Combine Strings
Join multiple strings into a single string.
prompt <- "Hello, my name is"
first <- "James"
last <- "Bond"
str_c(prompt, last, ",", first, last, sep = " ")[1] "Hello, my name is Bond , James Bond"
Similar to paste() and paste0().
Combine a vector of strings into a single string.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_flatten(my_vector, collapse = " ")[1] "Hello, my name is Bond James Bond"
str_c() will do the same thing, but you should use str_flatten() instead!
Use variables in the environment to create a string based on {expressions}.
first <- "James"
last <- "Bond"
str_glue("My name is {last}, {first} {last}")My name is Bond, James Bond
See the R package glue!
Tips for String Success
Refer to the
stringrcheatsheetRemember that
str_xxxfunctions need the first argument to be a vector of strings, not a dataset!- You will use these functions inside
dplyrverbs likefilter()ormutate().
- You will use these functions inside
| name | is_bran | manuf | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100% Bran | TRUE | N | cold | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 280 | 25 | 3 | 1.00 | 0.33 | 68.40297 |
| 100% Natural Bran | TRUE | Q | cold | 120 | 3 | 5 | 15 | 2.0 | 8.0 | 8 | 135 | 0 | 3 | 1.00 | 1.00 | 33.98368 |
| All-Bran | TRUE | K | cold | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 320 | 25 | 3 | 1.00 | 0.33 | 59.42551 |
| All-Bran with Extra Fiber | TRUE | K | cold | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 330 | 25 | 3 | 1.00 | 0.50 | 93.70491 |
| Almond Delight | FALSE | R | cold | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | -1 | 25 | 3 | 1.00 | 0.75 | 34.38484 |
| Apple Cinnamon Cheerios | FALSE | G | cold | 110 | 2 | 2 | 180 | 1.5 | 10.5 | 10 | 70 | 25 | 1 | 1.00 | 0.75 | 29.50954 |
| Apple Jacks | FALSE | K | cold | 110 | 2 | 0 | 125 | 1.0 | 11.0 | 14 | 30 | 25 | 2 | 1.00 | 1.00 | 33.17409 |
| Basic 4 | FALSE | G | cold | 130 | 3 | 2 | 210 | 2.0 | 18.0 | 8 | 100 | 25 | 3 | 1.33 | 0.75 | 37.03856 |
| Bran Chex | TRUE | R | cold | 90 | 2 | 1 | 200 | 4.0 | 15.0 | 6 | 125 | 25 | 1 | 1.00 | 0.67 | 49.12025 |
| Bran Flakes | TRUE | P | cold | 90 | 3 | 0 | 210 | 5.0 | 13.0 | 5 | 190 | 25 | 3 | 1.00 | 0.67 | 53.31381 |
| Cap'n'Crunch | FALSE | Q | cold | 120 | 1 | 2 | 220 | 0.0 | 12.0 | 12 | 35 | 25 | 2 | 1.00 | 0.75 | 18.04285 |
| Cheerios | FALSE | G | cold | 110 | 6 | 2 | 290 | 2.0 | 17.0 | 1 | 105 | 25 | 1 | 1.00 | 1.25 | 50.76500 |
| Cinnamon Toast Crunch | FALSE | G | cold | 120 | 1 | 3 | 210 | 0.0 | 13.0 | 9 | 45 | 25 | 2 | 1.00 | 0.75 | 19.82357 |
| Clusters | FALSE | G | cold | 110 | 3 | 2 | 140 | 2.0 | 13.0 | 7 | 105 | 25 | 3 | 1.00 | 0.50 | 40.40021 |
| Cocoa Puffs | FALSE | G | cold | 110 | 1 | 1 | 180 | 0.0 | 12.0 | 13 | 55 | 25 | 2 | 1.00 | 1.00 | 22.73645 |
| Corn Chex | FALSE | R | cold | 110 | 2 | 0 | 280 | 0.0 | 22.0 | 3 | 25 | 25 | 1 | 1.00 | 1.00 | 41.44502 |
| Corn Flakes | FALSE | K | cold | 100 | 2 | 0 | 290 | 1.0 | 21.0 | 2 | 35 | 25 | 1 | 1.00 | 1.00 | 45.86332 |
| Corn Pops | FALSE | K | cold | 110 | 1 | 0 | 90 | 1.0 | 13.0 | 12 | 20 | 25 | 2 | 1.00 | 1.00 | 35.78279 |
| Count Chocula | FALSE | G | cold | 110 | 1 | 1 | 180 | 0.0 | 12.0 | 13 | 65 | 25 | 2 | 1.00 | 1.00 | 22.39651 |
| Cracklin' Oat Bran | TRUE | K | cold | 110 | 3 | 3 | 140 | 4.0 | 10.0 | 7 | 160 | 25 | 3 | 1.00 | 0.50 | 40.44877 |
| Cream of Wheat (Quick) | FALSE | N | hot | 100 | 3 | 0 | 80 | 1.0 | 21.0 | 0 | -1 | 0 | 2 | 1.00 | 1.00 | 64.53382 |
| Crispix | FALSE | K | cold | 110 | 2 | 0 | 220 | 1.0 | 21.0 | 3 | 30 | 25 | 3 | 1.00 | 1.00 | 46.89564 |
| Crispy Wheat & Raisins | FALSE | G | cold | 100 | 2 | 1 | 140 | 2.0 | 11.0 | 10 | 120 | 25 | 3 | 1.00 | 0.75 | 36.17620 |
| Double Chex | FALSE | R | cold | 100 | 2 | 0 | 190 | 1.0 | 18.0 | 5 | 80 | 25 | 3 | 1.00 | 0.75 | 44.33086 |
| Froot Loops | FALSE | K | cold | 110 | 2 | 1 | 125 | 1.0 | 11.0 | 13 | 30 | 25 | 2 | 1.00 | 1.00 | 32.20758 |
| Frosted Flakes | FALSE | K | cold | 110 | 1 | 0 | 200 | 1.0 | 14.0 | 11 | 25 | 25 | 1 | 1.00 | 0.75 | 31.43597 |
| Frosted Mini-Wheats | FALSE | K | cold | 100 | 3 | 0 | 0 | 3.0 | 14.0 | 7 | 100 | 25 | 2 | 1.00 | 0.80 | 58.34514 |
| Fruit & Fibre Dates; Walnuts; and Oats | FALSE | P | cold | 120 | 3 | 2 | 160 | 5.0 | 12.0 | 10 | 200 | 25 | 3 | 1.25 | 0.67 | 40.91705 |
| Fruitful Bran | TRUE | K | cold | 120 | 3 | 0 | 240 | 5.0 | 14.0 | 12 | 190 | 25 | 3 | 1.33 | 0.67 | 41.01549 |
| Fruity Pebbles | FALSE | P | cold | 110 | 1 | 1 | 135 | 0.0 | 13.0 | 12 | 25 | 25 | 2 | 1.00 | 0.75 | 28.02576 |
| Golden Crisp | FALSE | P | cold | 100 | 2 | 0 | 45 | 0.0 | 11.0 | 15 | 40 | 25 | 1 | 1.00 | 0.88 | 35.25244 |
| Golden Grahams | FALSE | G | cold | 110 | 1 | 1 | 280 | 0.0 | 15.0 | 9 | 45 | 25 | 2 | 1.00 | 0.75 | 23.80404 |
| Grape Nuts Flakes | FALSE | P | cold | 100 | 3 | 1 | 140 | 3.0 | 15.0 | 5 | 85 | 25 | 3 | 1.00 | 0.88 | 52.07690 |
| Grape-Nuts | FALSE | P | cold | 110 | 3 | 0 | 170 | 3.0 | 17.0 | 3 | 90 | 25 | 3 | 1.00 | 0.25 | 53.37101 |
| Great Grains Pecan | FALSE | P | cold | 120 | 3 | 3 | 75 | 3.0 | 13.0 | 4 | 100 | 25 | 3 | 1.00 | 0.33 | 45.81172 |
| Honey Graham Ohs | FALSE | Q | cold | 120 | 1 | 2 | 220 | 1.0 | 12.0 | 11 | 45 | 25 | 2 | 1.00 | 1.00 | 21.87129 |
| Honey Nut Cheerios | FALSE | G | cold | 110 | 3 | 1 | 250 | 1.5 | 11.5 | 10 | 90 | 25 | 1 | 1.00 | 0.75 | 31.07222 |
| Honey-comb | FALSE | P | cold | 110 | 1 | 0 | 180 | 0.0 | 14.0 | 11 | 35 | 25 | 1 | 1.00 | 1.33 | 28.74241 |
| Just Right Crunchy Nuggets | FALSE | K | cold | 110 | 2 | 1 | 170 | 1.0 | 17.0 | 6 | 60 | 100 | 3 | 1.00 | 1.00 | 36.52368 |
| Just Right Fruit & Nut | FALSE | K | cold | 140 | 3 | 1 | 170 | 2.0 | 20.0 | 9 | 95 | 100 | 3 | 1.30 | 0.75 | 36.47151 |
| Kix | FALSE | G | cold | 110 | 2 | 1 | 260 | 0.0 | 21.0 | 3 | 40 | 25 | 2 | 1.00 | 1.50 | 39.24111 |
| Life | FALSE | Q | cold | 100 | 4 | 2 | 150 | 2.0 | 12.0 | 6 | 95 | 25 | 2 | 1.00 | 0.67 | 45.32807 |
| Lucky Charms | FALSE | G | cold | 110 | 2 | 1 | 180 | 0.0 | 12.0 | 12 | 55 | 25 | 2 | 1.00 | 1.00 | 26.73451 |
| Maypo | FALSE | A | hot | 100 | 4 | 1 | 0 | 0.0 | 16.0 | 3 | 95 | 25 | 2 | 1.00 | 1.00 | 54.85092 |
| Muesli Raisins; Dates; & Almonds | FALSE | R | cold | 150 | 4 | 3 | 95 | 3.0 | 16.0 | 11 | 170 | 25 | 3 | 1.00 | 1.00 | 37.13686 |
| Muesli Raisins; Peaches; & Pecans | FALSE | R | cold | 150 | 4 | 3 | 150 | 3.0 | 16.0 | 11 | 170 | 25 | 3 | 1.00 | 1.00 | 34.13976 |
| Mueslix Crispy Blend | FALSE | K | cold | 160 | 3 | 2 | 150 | 3.0 | 17.0 | 13 | 160 | 25 | 3 | 1.50 | 0.67 | 30.31335 |
| Multi-Grain Cheerios | FALSE | G | cold | 100 | 2 | 1 | 220 | 2.0 | 15.0 | 6 | 90 | 25 | 1 | 1.00 | 1.00 | 40.10596 |
| Nut&Honey Crunch | FALSE | K | cold | 120 | 2 | 1 | 190 | 0.0 | 15.0 | 9 | 40 | 25 | 2 | 1.00 | 0.67 | 29.92429 |
| Nutri-Grain Almond-Raisin | FALSE | K | cold | 140 | 3 | 2 | 220 | 3.0 | 21.0 | 7 | 130 | 25 | 3 | 1.33 | 0.67 | 40.69232 |
| Nutri-grain Wheat | FALSE | K | cold | 90 | 3 | 0 | 170 | 3.0 | 18.0 | 2 | 90 | 25 | 3 | 1.00 | 1.00 | 59.64284 |
| Oatmeal Raisin Crisp | FALSE | G | cold | 130 | 3 | 2 | 170 | 1.5 | 13.5 | 10 | 120 | 25 | 3 | 1.25 | 0.50 | 30.45084 |
| Post Nat. Raisin Bran | TRUE | P | cold | 120 | 3 | 1 | 200 | 6.0 | 11.0 | 14 | 260 | 25 | 3 | 1.33 | 0.67 | 37.84059 |
| Product 19 | FALSE | K | cold | 100 | 3 | 0 | 320 | 1.0 | 20.0 | 3 | 45 | 100 | 3 | 1.00 | 1.00 | 41.50354 |
| Puffed Rice | FALSE | Q | cold | 50 | 1 | 0 | 0 | 0.0 | 13.0 | 0 | 15 | 0 | 3 | 0.50 | 1.00 | 60.75611 |
| Puffed Wheat | FALSE | Q | cold | 50 | 2 | 0 | 0 | 1.0 | 10.0 | 0 | 50 | 0 | 3 | 0.50 | 1.00 | 63.00565 |
| Quaker Oat Squares | FALSE | Q | cold | 100 | 4 | 1 | 135 | 2.0 | 14.0 | 6 | 110 | 25 | 3 | 1.00 | 0.50 | 49.51187 |
| Quaker Oatmeal | FALSE | Q | hot | 100 | 5 | 2 | 0 | 2.7 | -1.0 | -1 | 110 | 0 | 1 | 1.00 | 0.67 | 50.82839 |
| Raisin Bran | TRUE | K | cold | 120 | 3 | 1 | 210 | 5.0 | 14.0 | 12 | 240 | 25 | 2 | 1.33 | 0.75 | 39.25920 |
| Raisin Nut Bran | TRUE | G | cold | 100 | 3 | 2 | 140 | 2.5 | 10.5 | 8 | 140 | 25 | 3 | 1.00 | 0.50 | 39.70340 |
| Raisin Squares | FALSE | K | cold | 90 | 2 | 0 | 0 | 2.0 | 15.0 | 6 | 110 | 25 | 3 | 1.00 | 0.50 | 55.33314 |
| Rice Chex | FALSE | R | cold | 110 | 1 | 0 | 240 | 0.0 | 23.0 | 2 | 30 | 25 | 1 | 1.00 | 1.13 | 41.99893 |
| Rice Krispies | FALSE | K | cold | 110 | 2 | 0 | 290 | 0.0 | 22.0 | 3 | 35 | 25 | 1 | 1.00 | 1.00 | 40.56016 |
| Shredded Wheat | FALSE | N | cold | 80 | 2 | 0 | 0 | 3.0 | 16.0 | 0 | 95 | 0 | 1 | 0.83 | 1.00 | 68.23588 |
| Shredded Wheat 'n'Bran | TRUE | N | cold | 90 | 3 | 0 | 0 | 4.0 | 19.0 | 0 | 140 | 0 | 1 | 1.00 | 0.67 | 74.47295 |
| Shredded Wheat spoon size | FALSE | N | cold | 90 | 3 | 0 | 0 | 3.0 | 20.0 | 0 | 120 | 0 | 1 | 1.00 | 0.67 | 72.80179 |
| Smacks | FALSE | K | cold | 110 | 2 | 1 | 70 | 1.0 | 9.0 | 15 | 40 | 25 | 2 | 1.00 | 0.75 | 31.23005 |
| Special K | FALSE | K | cold | 110 | 6 | 0 | 230 | 1.0 | 16.0 | 3 | 55 | 25 | 1 | 1.00 | 1.00 | 53.13132 |
| Strawberry Fruit Wheats | FALSE | N | cold | 90 | 2 | 0 | 15 | 3.0 | 15.0 | 5 | 90 | 25 | 2 | 1.00 | 1.00 | 59.36399 |
| Total Corn Flakes | FALSE | G | cold | 110 | 2 | 1 | 200 | 0.0 | 21.0 | 3 | 35 | 100 | 3 | 1.00 | 1.00 | 38.83975 |
| Total Raisin Bran | TRUE | G | cold | 140 | 3 | 1 | 190 | 4.0 | 15.0 | 14 | 230 | 100 | 3 | 1.50 | 1.00 | 28.59278 |
| Total Whole Grain | FALSE | G | cold | 100 | 3 | 1 | 200 | 3.0 | 16.0 | 3 | 110 | 100 | 3 | 1.00 | 1.00 | 46.65884 |
| Triples | FALSE | G | cold | 110 | 2 | 1 | 250 | 0.0 | 21.0 | 3 | 60 | 25 | 3 | 1.00 | 0.75 | 39.10617 |
| Trix | FALSE | G | cold | 110 | 1 | 1 | 140 | 0.0 | 13.0 | 12 | 25 | 25 | 2 | 1.00 | 1.00 | 27.75330 |
| Wheat Chex | FALSE | R | cold | 100 | 3 | 1 | 230 | 3.0 | 17.0 | 3 | 115 | 25 | 1 | 1.00 | 0.67 | 49.78744 |
| Wheaties | FALSE | G | cold | 100 | 3 | 1 | 200 | 3.0 | 17.0 | 3 | 110 | 25 | 1 | 1.00 | 1.00 | 51.59219 |
| Wheaties Honey Gold | FALSE | G | cold | 110 | 2 | 1 | 200 | 1.0 | 16.0 | 8 | 60 | 25 | 1 | 1.00 | 0.75 | 36.18756 |
Tips for String Success
The real power of these str_xxx functions comes when you specify the pattern using regular expressions!
knitr::include_graphics("images/regular_expressions.png")regex
Regular Expressions
“Regexps are a very terse language that allow you to describe patterns in strings.”
R for Data Science
. . .
Use str_xxx functions + regular expressions!
. . .
You might encounter gsub(), grep(), etc. from Base R.
Regular Expressions
Regular expressions are tricky!
- There are lots of new symbols to keep straight.
- There are a lot of cases to think through.
This web app for testing R regular expressions might be handy!
Special Characters
There is a set of characters that have a specific meaning when using regex.
- The
stringrpackage does not read these as normal characters. - These characters are:
. ^ $ \ | * + ? { } [ ] ( )
Wild Card Character: .
. – matches any character.
x <- c("She", "sells", "seashells", "by", "the", "seashore!")
str_subset(x, pattern = ".ells")[1] "sells" "seashells"
This matches strings that contain any character followed by “ells”.
Anchor Characters: ^ $
^ – looks at the beginning of a string.
x <- c("She", "sells", "seashells", "by", "the", "seashore!")
str_subset(x, pattern = "^s")[1] "sells" "seashells" "seashore!"
This matches strings that start with “s”.
. . .
$ – looks at the end of a string.
str_subset(x, pattern = "s$")[1] "sells" "seashells"
This matches strings that end with “s”.
Quantifier Characters: ? + *
? – matches when the preceding character occurs 0 or 1 times in a row.
x <- c("shes", "shels", "shells", "shellls", "shelllls")
str_subset(x, pattern = "shel?s")[1] "shes" "shels"
. . .
+ – … occurs 1 or more times in a row.
str_subset(x, pattern = "shel+s")[1] "shels" "shells" "shellls" "shelllls"
. . .
* – … occurs 0 or more times in a row.
str_subset(x, pattern = "shel*s")[1] "shes" "shels" "shells" "shellls" "shelllls"
Quantifier Characters: {}
{n} – matches when the preceding character occurs exactly n times in a row.
x <- c("shes", "shels", "shells", "shellls", "shelllls")
str_subset(x, pattern = "shel{2}s")[1] "shells"
. . .
{n,} – … occures at least n times in a row.
str_subset(x, pattern = "shel{2,}s")[1] "shells" "shellls" "shelllls"
. . .
{n,m} – … occurs between n and m times in a row.
str_subset(x, pattern = "shel{1,3}s")[1] "shels" "shells" "shellls"
Character Groups: ()
Groups are created with ( ).
- We can specify “either” / “or” within a group using
|.
x <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
str_subset(x, pattern = "p(e|i)ck")[1] "picked" "peck" "pickled"
This matches strings that contain either “peck” or “pick”.
Character Classes: []
Character classes let you specify multiple possible characters to match on.
x <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
str_subset(x, pattern = "p[ei]ck")[1] "picked" "peck" "pickled"
. . .
[^ ] – specifies characters not to match on (think except)
str_subset(x, pattern = "p[^i]ck")[1] "peck"
. . .
[Pp] – capitalization matters!
str_subset(x, pattern = "^p")[1] "picked" "peck" "pickled" "peppers!"
str_subset(x, pattern = "^[Pp]")[1] "Peter" "Piper" "picked" "peck" "pickled" "peppers!"
Character Classes: []
[ - ] – specifies a range of characters.
x <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
str_subset(x, pattern = "p[ei]ck[a-z]")[1] "picked" "pickled"
. . .
- [A-Z] matches any capital letter.
- [a-z] matches any lowercase letter.
- [A-z] or [:alpha:] matches any letter
- [0-9] or [:digit:] matches any number
- See the stringr cheatsheet for more shortcuts, like [:punct:]
Shortcuts
\\w – matches any “word” (\\W matches not “word”)
- A “word” contains any letters and numbers.
\\d – matches any digit (\\D matches not digit)
\\s – matches any whitespace (\\S matches not whitespace)
- Whitespace includes spaces, tabs, newlines, etc.
. . .
x <- "phone number: 1234567899"
str_extract(x, pattern = "\\d+")[1] "1234567899"
str_extract_all(x, pattern = "\\S+")[[1]]
[1] "phone" "number:" "1234567899"
Try it out!
What regular expressions would match words that…
- end with a vowel?
- start with x, y, or z?
- do not contain x, y, or z?
- contain British spelling?
x <- c("zebra", "xray", "apple", "yellow",
"color", "colour", "summarize", "summarise")Code
str_subset(x, "[aeiouy]$")
str_subset(x, "^[xyz]")
str_subset(x, "^[^xyz]+$")
str_subset(x, "(our)|(i[sz]e)")Escape: \\
To match a special character, you need to escape it.
x <- c("How", "much", "wood", "could", "a", "woodchuck", "chuck",
"if", "a", "woodchuck", "could", "chuck","wood?")
str_subset(x, pattern = "?")Error in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, : Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`?`)
. . .
Use \\ to escape the ? – it is now read as a normal character.
str_subset(x, pattern = "\\?")[1] "wood?"
. . .
Alternatively, you could use []:
str_subset(x, pattern = "[?]")[1] "wood?"
When in Doubt
knitr::include_graphics("images/backslashes.png")Use the web app to test R regular expressions.
Tips for working with regex
- Read the regular expressions out loud like a request.
. . .
- Test out your expressions on small examples first.
str_view()
str_view(c("shes", "shels", "shells", "shellls", "shelllls"), "l+")[2] │ she<l>s
[3] │ she<ll>s
[4] │ she<lll>s
[5] │ she<llll>s
. . .
- Use the
stringrcheatsheet.
. . .
- Be kind to yourself when working with regular expressions!
Strings in the tidyverse
stringr functions + dplyr verbs!
Country names with a (capital or lowercase) “Z”?
| Country |
|---|
| Mozambique |
| Tanzania |
| Zambia |
| Zimbabwe |
| Belize |
| Brazil |
| Venezuela |
| Kazakhstan |
| Kyrgyzstan |
| Uzbekistan |
| New Zealand |
| Bosnia-Herzegovina |
| Czechia |
| Czechoslovakia |
| Azerbaijan |
| Switzerland |
. . .
The proportion of country names with a compass direction?
matches(pattern)
Select all variables with a name that matches the supplied pattern.
- Pairs well with
select(),rename_with(), andacross().
military_clean <- military |>
mutate(across(`1988`:`2019`,
~ na_if(.x, y = ". .")),
across(`1988`:`2019`,
~ na_if(.x, y = "xxx")))military_clean <- military |>
mutate(across(matches("[1-9]{4}"),
~ na_if(.x, y = ". .")),
across(matches("[1-9]{4}"),
~ na_if(.x, y = "xxx")))Messy Covid Variants!
What is that variable?!
[{'variant': 'Other', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 4.59}, {'variant': 'V-20DEC-01 (Alpha)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21APR-02 (Delta B.1.617.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21OCT-01 (Delta AY 4.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-22DEC-01 (Omicron CH.1.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 24.56}, {'variant': 'V-22JUL-01 (Omicron BA.2.75)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 8.93}, {'variant': 'V-22OCT-01 (Omicron BQ.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 49.57}, {'variant': 'VOC-21NOV-01 (Omicron BA.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.02}, {'variant': 'VOC-22APR-03 (Omicron BA.4)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.08}, {'variant': 'VOC-22APR-04 (Omicron BA.5)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.59}, {'variant': 'VOC-22JAN-01 (Omicron BA.2)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 1.41}, {'variant': 'unclassified_variant', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.26}]
PA 5.1: Scrambled Message
In this activity, you will use regex to decode a message.
- Remember:
stringrfunctions go insidedplyrverbs likemutate()andfilter()– use them likeas.factor.
. . .
- Reminder about indexing vectors:
x <- c("She", "sells", "seashells", "by", "the", "seashore!")- Grab elements out of a vector with
[].
x[c(1,4,5)][1] "She" "by" "the"
- To replace those elements, use
<-to assign new values.
x[c(1,4,5)] <- ""To do…
- PA 5.1: Scrambled Message
- Due Saturday, 5/4 at 11:59pm
Wednesday, May 1
Today we will…
- Midterm Exam 5/8: What to Expect
- New Material
- Working with Date & Time Variables
- PA 5.2: Jewel Heist
- LA 5: Murder in SQL City
Midterm Exam – Wednesday, 5/8
- This is a three-part exam:
- You will first complete a General Questions section on paper and without your computer.
- After you turn that in, you will complete a Short Answer section with your computer.
- You will have the one hour and 50 minute class period to complete the first two sections.
- The third section, Open-Ended Analysis, will be started in class and due 24 hours after the end of class.
Midterm Exam – Wednesday, 5/8
- The exam is worth approximately 90 points.
- Approx. 20 pts, 30 pts, and 40 pts for the three sections.
- I will provide a
.qmdtemplate for the Short Answer. - You will create your own
.qmdfor the Open-Ended Analysis. You are encouraged to create this ahead of time.
While the coding tasks are open-resource, you will likely run out of time if you have to look everything up. Know what functions you might need and where to find documentation for implementing these functions!
Date + Time Variables
Why are dates and times tricky?
When parsing dates and times, we have to consider complicating factors like…
- Daylight Savings Time.
- One day a year is 23 hours; one day a year is 25 hours.
- Some places use it, some don’t.
- Leap years – most years have 365 days, some have 366.
- Time zones.
lubridate
Common Tasks
Convert a date-like variable (“May 8, 1995”) to a date or date-time object.
Find the weekday, month, year, etc from a date-time object.
Convert between time zones.
knitr::include_graphics("https://github.com/rstudio/hex-stickers/blob/main/thumbs/lubridate.png?raw=true")The lubridate package installs with tidyverse, but does not load.
library(lubridate)date-time Objects
There are multiple data types for dates and times.
- A date:
dateorDate
- A date and a time (identifies a unique instant in time):
dtmPOSIXlt– stores date-times as the number of seconds since January 1, 1970 (“Unix Epoch”)POSIXct– stores date-times as a list with elements for second, minute, hour, day, month, year, etc.
Creating date-time Objects
Create a date from individual components:
make_date(year = 1995, month = 05, day = 08)[1] "1995-05-08"
. . .
Create a date from a string:
mdy("May 8, 1995")[1] "1995-05-08"
dmy("8-May-1995", tz = "America/Chicago")[1] "1995-05-08 CDT"
dmy_hms("8-May-1995 9:32:12", tz = "America/Chicago")[1] "1995-05-08 09:32:12 CDT"
as_datetime("95-05-08", format = "%y-%m-%d")[1] "1995-05-08 UTC"
parse_datetime("5/8/1995", format = "%m/%d/%Y")[1] "1995-05-08 UTC"
Creating date-time Objects
Common Mistake with Dates
What’s wrong here?
as_datetime(2023-02-6)[1] "1970-01-01 00:33:35 UTC"
my_date <- 2023-02-6
my_date[1] 2015
. . .
Make sure you use quotes!
- 2,015 seconds \(\approx\) 33.5 minutes
Extracting date-time Components
bday <- ymd_hms("1993-11-20 9:32:12", tz = "America/New_York")
bday[1] "1993-11-20 09:32:12 EST"
year(bday)[1] 1993
month(bday)[1] 11
day(bday)[1] 20
wday(bday)[1] 7
wday(bday, label = TRUE, abbr = FALSE)[1] Saturday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
Subtraction with date-time Objects
Doing subtraction gives you a difftime object.
difftimeobjects do not always have the same units – it depends on the scale of the objects you are working with.
How old am I?
today() - mdy(11201993)Time difference of 11288 days
How long did it take me to finish a typing challenge?
begin <- mdy_hms("3/1/2023 13:04:34")
finish <- mdy_hms("3/1/2023 13:06:11")
finish - beginTime difference of 1.616667 mins
Durations and Periods
Durations will always give the time span in an exact number of seconds.
as.duration(today() - mdy(11201993))[1] "975283200s (~30.9 years)"
as.duration(finish - begin)[1] "97s (~1.62 minutes)"
. . .
Periods will give the time span in more approximate, but human readable times.
as.period(today() - mdy(11201993))[1] "11288d 0H 0M 0S"
as.period(finish - begin)[1] "1M 37S"
Durations and Periods
We can also add time:
days(),years(), etc. will add a period of time.ddays(),dyears(), etc. will add a duration of time.
. . .
Because durations use the exact number of seconds to represent days and years, you might get unexpected results:
When is is my 99th birthday?
mdy(11201993) + years(99)[1] "2092-11-20"
mdy(11201993) + dyears(99)[1] "2092-11-19 18:00:00 UTC"
Time Zones
Time zones are complicated!
Specify time zones in the form:
- {continent}/{city} – “America/New_York”, “Africa/Nairobi”
- {ocean}/{city} – “Pacific/Auckland”
. . .
What time zone does R think I’m in?
Sys.timezone()[1] "America/Los_Angeles"
Time Zones
You can change the time zone of a date in two ways:
x <- ymd_hms("2024-06-01 18:00:00", tz = "Europe/Copenhagen")Keeps the instant in time the same, but changes the visual representation.
x |>
with_tz()[1] "2024-06-01 09:00:00 PDT"
x |>
with_tz(tzone = "Asia/Kolkata")[1] "2024-06-01 21:30:00 IST"
Changes the instant in time by forcing a time zone change.
x |>
force_tz()[1] "2024-06-01 18:00:00 PDT"
x |>
force_tz(tzone = "Asia/Kolkata")[1] "2024-06-01 18:00:00 IST"
Common Mistake with Dates
When you read data in or create a new date-time object, the default time zone (if not specified) is UTC.
- UTC (Universal Time Coordinated) is the same as GMT (Greenwich Mean Time).
Make sure you specify your desired time zone!
x <- mdy("11/20/1993")
tz(x)[1] "UTC"
x <- mdy("11/20/1993", tz = "America/New_York")
tz(x)[1] "America/New_York"
PA 5.2: Jewel Heist
Just down the road in Montecito, CA several rare jewels went missing last fall. The jewels were stolen and replaced with fakes, but detectives have not been able to solve the case. They are now calling in a data scientist to help parse their clues.
Unfortunately, the date and time of the jewel heist is not known. You have been hired to crack the case. Use the clues below to discover the thief’s identity.
Submit the name of the thief to the Canvas Quiz.
Lab 5: Murder in SQL City
To do…
PA 5.2: Jewel Heist – due Saturday, 5/4 at 11:59pm.
Lab 5: Murder in SQL City – due Saturday, 5/4 at 11:59pm.
Read Chapter 6: Version Control
- Check-ins 6.1 + 6.2 due Monday 5/6, at 10:00am
Comments from Week 4
When describing data, include context as well as the data characteristics.